\section{Related Work} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\label{sec:rw}

Language support for pattern matching was first introduced for string manipulation 
in COMIT\cite{COMIT58}, which subsequently inspired similar primitives in 
SNOBOL\cite{SNOBOL64}. SNOBOL4 had string patterns as first-class data types 
providing operations of concatenation and alternation\cite{SNOBOL71}.
The first reference to a modern pattern-matching constructs seen in functional 
languages is usually attributed to Burstall's work on structural 
induction\cite{Burstall69provingproperties}. Pattern matching was further 
developed by the functional programming community, most notably 
Hope\cite{BMS80}, Miranda\cite{Miranda85}, ML\cite{ML90} and 
Haskell\cite{Haskell98Book}. In the context of object-oriented programming, 
pattern matching has been first explored in Pizza\cite{Odersky97pizzainto} and 
Scala\cite{Scala2nd,EmirThesis}. The idea of first class patterns dates back to 
at least Tullsen's proposal to add them to Haskell~\cite{Tullsen00}, calculus of 
such patterns has been studied in details by Jay~\cite{Jay09,PatCalc09}.

There are two main approaches to compiling pattern-matching code: the first is 
based on \emph{backtracking automata} and was introduced by Augustsson\cite{Augustsson85}, 
the second is based on \emph{decision trees} and was first described by 
Cardelli\cite{Cardelli84}, though he attributes the technique to Dave MacQueen 
and Gilles Kahn in their implementation of the Hope compiler \cite{BMS80}.
Backtracking approach usually generates smaller code~\cite{OPM01}, while decision tree 
approach produces faster code by ensuring that each primitive test is only 
performed once~\cite{Maranget08}. With respect to matching a single expression our library 
approach follows the naive backtracking approach, however our match statement is 
based on a highly efficient type switching technique we developed\cite{TypeSwitch} 
that outperforms similar solutions based on decision trees or visitor design pattern.

Memoization device we proposed is not specifically concerned with compiling 
pattern matching and can be used independently. In particular it can be combined 
with either backtracking or decision tree approaches to avoid subsequent 
decisions on datum that has already been seen.

\emph{MatchO} was one of the first attempts to implement pattern matching without 
extending the host language~\cite{Visser06matchingobjects}. While the library 
was essentially providing first class patterns into Java, the amount of 
syntactic and run-time overhead was overwhelming. The author offers to provide 
customized syntax to patterns by invoking an inline parser, while to improve 
performance the author looks into combining visitors and visitor combinators 
with pattern matching. %This resembles our use of an efficient type switch to 
%uncover the dynamic type combined with slower but more flexible pattern 
%matching.

\emph{Functional C\#} was a similar approach to bringing pattern matching into 
C\# as a library\cite{FuncCSharp}. The approach uses lambda functions and 
chaining of method calls to create a structure that is then evaluated at 
run-time for the first successful match. The approach supports a form of 
active patterns, simple n+k patterns, list and tuple patterns as well as type 
patterns (without structural decomposition). Unfortunately, the solution scales 
very poorly for match statements with more than two case clauses as it is 
essentially based on sequential type tests. Besides, the approach is ill suited 
for tests involving nesting of patterns.

Both \emph{MatchO} and \emph{Functional C\#} were object-oriented solutions. In 
functional community, Rhiger explored a similar idea of introducing pattern 
matching into Haskell as a library~\cite{Rhiger09}. He uses functions to encode 
patterns and pattern combinators, which allows him to detect pattern 
misapplication errors at compile time through the Haskell type system. 
Unfortunately, the purity of a functional language prevents him from expressing 
variable patterns (more specifically, the bindings they introduce) in a library 
setting simply, which then affects the overal ellegance of the solution. In 
particular, it seems to be impossible in his approach to express equivalence 
patterns or even equivalence combinator as the value of the bound variable is 
not available in the left-hand side of the matching expression, only in the 
right-hand side. The library is sufficiently expressive though to express basic 
patterns (value, wildcard, predicate, etc.), boolean pattern combinators and set 
patterns. The author reports that his solution requires 2-4 times more 
reductions than Haskell's built-in pattern matching. He attributes the overhead 
to currying and continuations passing and hypothsizes that it should be completly 
eliminated by standard partial evaluation techniques.

\emph{Grace} is another programming language that provides a library solution to 
pattern matching through objects~\cite{Grace2012}. The language is aimed at 
teaching various programming paradigms and has an interesting peculiarity: it 
does not have pre-defined control structures, which instead are expressed 
directly in the language with the help of partial functions and lambda 
expressions. The \codeocaml{match}-statement, for example, takes form of a 
clausal definition \codeocaml{match(e)case(c1)...case(cn)}, specialized for 
large values of $n$. 
 
this with clausal function  
definitions i.e. if(c)then(b1)else(b2) with blocks (lambda expressions) passed 
as second and third parameter. They provide match(e)case(c1)...case(cn) for some 
large number n (cleanly) and have a workaround that hardcodes in the compiler 
support for arbitrary number of cases. Patterns are objects in Grace that are 
composed and applied at run-time. The compiler has some syntactic sugar to 
denote some most commone ones, but otherwise user-defined patterns have to be 
created explicitly. Grace has structural subtyping, not a nominative one, so 
their case analysis is never hierarchical like in our case. The approach does 
not seem to provide for optimizations.  I spoke with one of the authors, who 
presented the slides and he said that performance was indeed not a concern for 
them and only the expressiveness of the language and ease of its teaching to 
students was. The paper is not directly related to type switch, however it is 
related to the second pattern-matching paper. It also mentions some interesting 
works I haven't seen before that we will have to discuss in the second paper. 
%xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

\emph{Lookup caches} have been long used to reduce the overhead of 
dynamically-bound message passing in Smalltalk~\cite{UngarPatterson83}. 
\emph{Inline caching} improves on that by noting that the type of the receiver at a given 
call site rarely varies so that the call instruction can be speculatively 
modified to jump directly to a previously looked up method~\cite{Deutsch84}. 
In this case, the method must ensure that the type of the receiver has 
not changed and redirect the call to generic lookup otherwise. The effects of 
inline caching on modern architectures can be seen through hardware call target 
prediction, which in our case is exemplified by repetitive benchmark: both 
virtual function calls and the underlying jump-table implementation of the 
\code{Match}-statement are about twice as fast as usual.

\emph{Polymorphic Inline Caches}~\cite{Holzle:Chambers:Ungar:91} generalize the 
idea of inline caches further by building a decision tree in the method prologue 
that caches all lookup results. The main difference of this approach from our 
work is that it requires code generation at run time, while we do not require
re-compilation, re-linking or any computations in case of dynamic linking.
The reason for this is the difference in the initial setting: they map an
arbitrary number of receiver types to an arbitrary number of implementations, while 
we map an arbitrary number of receivers to a fixed number of jump 
targets. This lets us generate code at compile time that incorporates both the 
initial and memoized execution.

\emph{Extensible Visitors with Default Cases}~\cite[\textsection 
4.2]{Zenger:2001} attempt to solve the extensibility problem of visitors; 
however, the solution, after 
remapping it onto C++, has problems of its own. The visitation interface 
hierarchy can easily be grown linearly (adding new cases for the new classes in 
the original hierarchy each time), but independent extensions by different  
authorities require developer's intervention to unify them all, before they can 
be used together. This may not be feasible in environments that use dynamic 
linking. To avoid writing even more boilerplate code in new visitors, the 
solution would require usage of virtual inheritance, which typically has 
an overhead of extra memory dereferencing. On top of the double dispatch already 
present in the visitor pattern, the solution will incur two additional virtual 
calls and a dynamic cast for each level of visitor extension. Additional double 
dispatch is incurred by forwarding of default handling from a base visitor to a 
derived one, while the dynamic cast is required for safety and can be replaced 
with a static cast when the visitation interface is guaranteed to be grown linearly 
(extended by one authority only). Yet another virtual call is required to be 
able to forward computations to subcomponents on tree-like structures to the 
most derived visitor. This last function lets one avoid the necessity of using 
the heap to allocate a temporary visitor through the \emph{Factory Design 
Pattern}~\cite{DesignPatterns1993} used in the \emph{Extensible Visitor} solution 
originally proposed by Krishnamurti, Felleisen and Friedman~\cite{Krishnamurthi98}.

In order to address the expression problem in Haskell, L\"{o}h and Hinze proposed to 
extend its type system with open data types and open functions~\cite{LohHinze2006}.
Their solution allows the user to mark top-level data types and functions as 
open and then provide concrete variants and overloads anywhere in the program. 
Open data types are extensible but not hierarchical, which largely avoids the 
problems discussed here. The semantics of open extension is given by 
transformation into a single module, where all the definitions are seen in one 
place. This is a significant limitation of their approach that prevents it from 
being truly open, since it essentially assumes a whole-program view, which 
excludes any extension via DLLs. As is the case with many other implementations 
of open extensions, the authors rely on the closed world for efficient 
implementation: in their implementation, \emph{``data types can only be entirely 
abstract (not allowing pattern matching) or concrete with all constructors with 
the reason being that pattern matching can be compiled more efficiently if the 
layout of the data type is known completely''}. The authors also believe that 
\emph{there are no theoretical difficulties in lifting this restriction, but it 
might imply a small performance loss if closed functions pattern match on open 
data types}. Our work addresses exactly this problem, showing that it is not 
only theoretically possible but also practically efficient and in application to 
a broader problem.

Andrew Kennedy et al~\cite{GADTOOP05} considered encoding of generalized 
algebraic data types~\cite{SPJ06} using visitor design patterns in C\#.  That 
translation made essential use of generic methods, the equivalent of ``virtual 
function template'' (as they would be called if such functionalities existed in 
C++) to handle some of the open set aspects of GADTs.

Polymorphic variants in OCaml~\cite{garrigue-98} allow the addition of new variants 
later. They are simpler, however, than object-oriented extensions, as they do not 
form subtyping between variants themselves, but only between combinations of them. 
This makes an important distinction between \emph{extensible sum types} like 
polymorphic variants and \emph{extensible hierarchical sum types} like classes.
An important property of extensible sum types is that each value of the 
underlying algebraic data type belongs to exactly one disjoint subset, tagged with 
a constructor. The \emph{nominative subtyping} of object-oriented languages does 
not usually have this disjointness making classes effectively have multiple 
types. In particular, the case of disjoint constructors can be seen as a 
degenerated case of a flat class hierarchy among the multitude of possible class 
hierarchies.

Tom is a pattern-matching compiler that can be used together with Java, C or 
Eiffel to bring a common pattern matching and term rewriting syntax into the 
languages\cite{Moreau:2003}. It works as a preprocessor that transforms 
syntactic extensions into imperative code in the target language. Tom is quite 
transparent as to the concrete target language used and can potentially be 
extended to other target languages besides the three supported now. In 
particular, it never uses any semantic information of the target language during 
the compilation process and it does not inspect nor modify the source language 
part (their preprocessor is only aware of parenthesis and block delimiters of 
the source language). Tom has a sublanguage called Gom that can be used to 
define algebraic data types in a uniform mannaer, which their preprocessor then 
transforms into conrete definitions in the target language. Alternatively, the 
user can provide mappings to his own data structures that the preprocessor will 
use to generate the code.

In comparison to our approach Tom has much bigger goals. The combination of 
pattern matching, term rewriting and strategies turns Tom into a 
tree-transformation language similar to Stratego/XT, XDuce and others. 
The main accent is made on expressivity and the speed of development, which 
makes one often wonder about the run-time complexity of the generated code.
Tom's approach is also prone to general problems of any preprocessor based 
solution\cite[\textsection 4.3]{SELL}. For example, when several preprocessors 
have to be used together, each independent extension may not be able to 
understand the other's syntax, making it impossible to form a toolchain.
A library approach we follow avoids most of these problems by relying only on a 
standard C++ compiler. It also lets us employ the C++ semantics within 
patterns: e.g. our patterns work directly on underlying user-defined data 
structures, largely avoiding abstraction penalties. The tight integration with 
the language semantics also makes our patterns first-class citizens that can be 
composed and passed to other functions. The approach we take to type switching 
can also be used by Tom's preprocessor to implement type patterns efficiently -- 
similarly to other object-oriented languages, Tom's handling of them is based on 
highly inefficient \code{instanceof} operator and its equivalents in other 
languages.

Pattern matching in Scala~\cite{Scala2nd} also supports type switching through 
type patterns. The language supports extensible and hierarchical data types, but 
their handling in a type switching constructs varies. Sealed classes are handled 
with an efficient switch over all tags, since sealed classes cannot be extended. 
Classes that are not sealed are similarly approached with a combination of an 
\code{InstanceOf} operator and a decision tree~\cite{EmirThesis}.

%An example would be our generalized n+k patterns where we 
%can turn any invertible function even user defined into a pattern.

Prop is another language extension that brings pattern matching into 
\Cpp{}~\cite{Prop96}. This extension is not focused on pattern 
matching, but is intended for building high performance 
compiler and language transformation systems. It supports value-, variable-, 
wildcard-, constructor-, nested-, as-, type- and numerous sequence patterns.

Functional C\# is similar to our approach in trying to bring pattern matching 
into the C\# as a library\cite{FuncCSharp}. The approach uses lambda functions 
and chaining of method calls to create a structure that is then interpreted at 
run-time for the first successful predicate. The approach supports a form of 
active patterns, simple n+k patterns, list and tuple patterns as well as type 
patterns (without structural decomposition). 
However, an approach based on sequential type tests 
scales very poorly for match statements with more than two case clauses, making 
it unreasonably slower than the visitor design pattern~\cite{TypeSwitch}. Besides, the approach 
seems to be ill suited for tests involving nesting of patterns.

There has been previous attempts to use pattern matching with the Pivot 
framework that we used to experiment with our library. In his dissertation, 
Pirkelbauer devised a pattern language capable of representing various entities 
in a C++ program. The patterns were then translated with a tool into a set of 
visitors implementing the underlying pattern matching 
semantics\cite{PirkelbauerThesis}. Earlier, Cook et al used expression templates 
to implement a query language for Pivot's Internal Program Representation 
\cite{iql04}. While their work was built around a concrete class hierarchy 
letting them put some semantic knowledge about concrete classes into the 
The principal difference of their work from this work is that 
authors were essentially creating a pattern matcher for a given class hierarchy 
and thus could take the semantics of the entities represented by classes in the 
hierarchy into account. Our approach is parametrized over class hierarchy and 
thus provides a rather lower level pattern-matching functionality that lets one 
simplify work with that hierarchy.  One can think of it as a generalized 
dynamic\_cast.

Racket also includes a powerful enough macro system that allows it to express a 
solution to first class patterns in the language entirely as a 
library~\cite{Tobin-Hochstadt_2010}. The work builds on earlier attempts to 
bring pattern matching into Scheme, extending it and making it more efficient~\cite{Wright95}.

When the class hierarchy is fixed, one can design a pattern language that involves 
semantic notions represented by the hierarchy. Pirkelbauer devised a pattern 
language for Pivot capable of representing various entities in a \Cpp{} program using syntax very close to the \Cpp{} itself. 
Interestingly, the patterns were translated with a tool into a set of visitors 
implementing the underlying pattern-matching semantics\cite{PirkelbauerThesis}. 
Earlier, Cook et al used expression templates to implement a query language for 
Pivot's class hierarchy~\cite{iql04}. %Our current work is the result of a series 
%of experimental designs. The library approach was essential to provide 
%relatively quick turnaround for experiments and for maintaining and improving 
%performance for our applications.

Newspeak and Grace treat their case statement as combination of partial 
functions -- functions defined on a restricted domain of inputs. Functional C 
sharp does ...

Citation\cite{padl08}:
Very good reference that I have never seen somehow. They have some good usage 
examples that might be interesting to recreate in our library. It's not a 
library solution since they modify the language, but it has a lot of 
similarities to our work in goals. They are only concerned with composability in 
this work however as I'm sure their implementation as is has a lot of virtual 
function call overhead, which they plan to optimize in the future work.  

Another interesting thing about this paper is that they coin the term 
"deconstructor" that I was trying to coin for what we call "bindings", but which 
Dr. Stroustrup rejected as such that is commonly confused with destructor on the 
C++ discussion boards. Interestingly, Google did not show me this paper when I 
was searching word "deconstructor" on the internet back then. 

Robust Scripting via Patterns~\cite{Thorn2012}
 
The paper deals with bringing in pattern matching to a dynamically typed 
language, namely Thorn. Similarly to Grace, the patterns are first-class 
citizens. A class may introduce a regular constructor via new keyword and a 
desctructuring constructor via pat keyword. The comprehension mechanism of 
regular constructors can then be also used for destructuring constructor in 
patterns. The paper shows some usage statistics of various patterns in real code 
at IBM as well as contexts in which they are most oftenly used (Thorn allows 
patterns in places other than just match statement). As with Grace, the pattern 
matching is executed the naive way without any optimizations (Thorn is an 
interpreter) and they do not address the question of how optimization of 
user-defined patterns can be done generally - which is what would be of interest 
for a compiler implementation of pattern matching in C++. The related work and 
references don't add much to what we know already on the subject, the first 
paper is a better source in this respect. 

Citation\cite{Tobin-Hochstadt_2010}:
This is the paper that i think Xavier Leroy was talking about on Scheme's 
pattern matching library. Well, there is another similar work by Wright from 96 
that is cited elsewhere, but i'm not sure which of them Dr. Leroy was talking 
about and what is the relationship between these two papers.

Citation\cite{Visser06matchingobjects}:
We should probably quote it as another library approach, but the amount of 
syntactic and run-time overhead there is overwhelming, which is why i don't 
understand how can he possibly complain about our syntactic overhead.

Citation\cite{Holzle:Chambers:Ungar:91}:
There is indeed similarity between Polymorphic Inline Caches (PICs) of Holzle 
and our memoization approach, but that's a similarity of any memoization 
technique I guess, we should probably cite them in any case.

PICs generate code at run-time (function stub that does type testing with ifs) 
and require the source code of the entire program be available at run-time. Our 
approach, as reviewer 1 points out, neither needs re-compilation, nor re-linking 
nor even any computations at dynamic linking time. Our setting is quite 
different from theirs though - they are mapping arbitrary number of receiver 
types to arbitrary number of implementations of a specific dynamically looked up 
method, called at that call site. We are mapping an arbitrary number of 
receivers to a fixed amount of jump targets! We are thus able to generate code 
at compile time that can be executed two ways: sequentially and directly with a 
jump. The code executed sequentially memoizes its own target jumps for next 
execution as well as prepares any data that would be needed to re-establish any 
invariants guaranteed by sequential execution. The novelty of specifically 
memoization device is thus in a code layout that can memoize its own sequential 
execution. 

Also a small observation about Inline Caches (IC) that PICs are based on. 
Virtual function calls on modern architectures are essentially executed with IC, 
because of the hardware call target prediction. That can be easilty seen from 
our repetitive benchmark where calls on different objects of the same reciever 
type are twice faster than virtual calls on objects of different reciever types. 
Our repetitive benchmark was specifically designed to compare our performance 
for this specific case. Nevertheless we are still faster than visitors in this 
case. Holzle himself writes that inline caches are effective only if the 
receiver type (and thus the call target) remains relatively constant at a call 
site. 

Now, the PIC they design to overcome the IC problem does a 
sequence/decision-tree of type tests, which we show is much slower than visitors 
in Figure 2. Any decision-tree based implementation will suffer from numerous 
ifs and won't be able to compete with 2 indirect calls of visitor design 
pattern.  In contrast, our approach performs well regardless of whether the 
target type remains the same (our repetitive benchmark, monomorphic calls in 
their terms) or picked from a small group of equally possible reciever types 
(our random benchmark, their polymorphic calls case) or picked from a large 
group of different receiver types (our random benchmark, their megamorphic 
case). In PICs they are mentioning that specific PIC can be optimized by 
rearranging order of tests, making tests with decision trees or even a hashed 
approach. In our approach we use the compactness of vtbl-pointers that we 
observe and the adaptive caching function we build to achieve almost perfect 
caching.

Citation\cite{Clifton:2000:MMO:353171.353181}:
The cascading-if (or decision tree) implementation was only a starting point for 
our approach because it is open to class extensions. We then optimized it to be 
open and fast with the help of memoization device.  

As to multi-methods, we should probably remove any reference to multiple 
dispatch we have as it became a way for people to jump off the topic. 

Citation\cite{Chambers:1999:EMP:320384.320407}:
We refer to work of Ernst, Kaplan, and Chambers on unified theory of predicate 
dispatching, but should probably refer to this citation instead as it talks not 
only about the theory of predicate dispatch but also efficient implementation. 
The techniques discussed here however are orthogonal to ours: they decide how to 
layout the code best for any execution, while we can still apply memoization 
device on top of whatever they layed out to jump directly to the chosen branch 
on subsequent runs. The discussion they have in section 4 on approaches to 
implementing multiway-type-testing exactly stresses our point - they are 
efficient when Class IDs are sequential and they have to resort to their 
decision tree generation otherwise. We can probably refer to that in discussion 
of why efficient type switching is not straightforward (and by efficient we mean 
comparable to visitors).

Citation\cite{PQEncoding,Zibin:2002:FAC:582419.582434}:
We reference the work of Joseph Gil in section 4.1 of technical report. The 
paragraph got out at last moment because of the page limit cuttings. These works 
however are all falling into the category of type testing based implementation 
of the type switch, which we show is slower than visitors based on approach of 
Gibbs and Stroustrup, which has an overhead of a single integer division in our 
tests. 

Besides, most of the approaches discussed in that work are not open - they 
usually either assume a certain structure of class ids or whole program view 
(whole class hierarchy view).

Citation\cite{runabout}:
The Runabout approach of Grothoff is, quoting the author, "only 2-10 times 
slower than visitors", and thus is not efficient accordingly to our definition 
of efficient. The approach uses reflection mechanisms of Java to lockate the 
visit method, but then does optimization over the previous Walkabout approach in 
generating a verified byte-code at run-time to call visit method instead of 
calling it with reflection.

Citation\cite{Millstein:2002:MTH:581478.581489}:
I have been looking at the TOPLAS version of that work before and could not 
directly relate it to ours because the work gives syntax and semantics of 
features, but not how to efficiently implement those. I think the work is closer 
to our multi-methods work than to type switching.

The Related Work section should probably also discuss DYNAMIC languages that
use call-site caching. In particular, Objective-C comes to mind. Apple has
tweaked the core dispatch performance for many years and has a pretty fast

Mentioned here: http://cs.swan.ac.uk/~csdavec/libobjc/libobjc.pdf 
implementation at this point. Microsoft's Dynamic Language Runtime for the CLR
also has a call-site caching feature. It should be discussed and compared to.

Mentioned here:  http://msdn.microsoft.com/en-us/library/dd233052.aspx search for "call site"

A number of constant-time type inclusion techniques have been proposed. 
Binary Matrix generates very sparce quadratic table that will have to be updated 
at load time to account for loaded class extensions. Cohen's algorithm depends 
on unique identifiers assigned to each class and it doesn't support multiple 
inheritance. 
Hierarchical Encoding requires to see the entire class hierarchy and may incure 
prohibitively expensive computations at load time due to NP-hard problem that 
has to be solved to find optimal bit vector encoding. 
Relative Numbering used in Modula-3 doesn't have an obvious way of supporting 
multiple inheritance.

The advantage of our technique is that it localizes entire implementation to 
type switch only - no modifications of global compiler tables/virtual tables is 
needed. Unlike IC or PIC no changes to executable code is required either, which 
makes it attractive for todays OS often enforcing no changes to code.